In this short note we discuss customer targeting through telemarketing phone calls to sell long-term deposits. More specifically, within a campaign, the human agents execute phone calls to a list of clients to sell the deposit (outbound clients) or, if meanwhile the client calls the contact-center for any other reason, he is asked to subscribe the deposit (inbound client). Thus, the result is a binary one, i.e. the client can either subscribe for a term deposit ('yes'
) or not ('no'
).
The data set we use is provided by the UCI ML Repository, has 41188 examples of customer (and prospects) respone in this telemarketing campaign and 20 other attributes. These attributes describe their personal characteristics (e.g. age, type of job, marital status, educational level), their credit and loan data (e.g. credit in default, existence of housing and/or personal loan), details concerning their behavior during the telemarketing campaign (e.g. number of contacts performed, number of days that passed by after the client was last contacted, number of contacts performed before this campaign) and some important socioeconomic indicators (e.g. CPI, CCI, euribor 3 month rate). These response data have been ordered by date (from May 2008 to November 2010), and it is very close to the data analyzed in Moro et al., 2014 some years ago.
[Moro et al., 2014] S. Moro, P. Cortez and P. Rita. A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014
Dataset has been provided from the UCI ML Repository: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing
In [1]:
import graphlab as gl
import pandas as pd
from datetime import datetime
from sklearn.cross_validation import StratifiedKFold
In [2]:
## load data set from a locally saved csv file
bank_marketing = gl.SFrame.read_csv('./../../../04.UCI.ML.REPO/Bank_Marketing/bank-additional/bank-additional-full.csv',
delimiter=';')
## other methods of loading data sets...
# data = gl.SFrame('s3://' or 'hdfs://')
# data # pySpark RDD or SchemaRDD / Spark DataFrame
# data = gl.SFrame.read_json('')
# With a DB: configure ODBC manager / driver on the machine
# data = gl.connect_odbc?
# data = gl.from_sql?
In [3]:
bank_marketing.head()
Out[3]:
The original dataset has the following attribute information:
Field Num | Field Name | Description |
---|---|---|
1 | age |
(numeric) |
2 | job |
type of job (categorical: 'admin.', 'blue-collar', 'entrepreneur', 'housemaid', 'management', 'retired', 'self-employed', 'services', 'student', 'technician', 'unemployed', 'unknown') |
3 | marital |
marital status (categorical: 'divorced', 'married', 'single', 'unknown'; note: 'divorced' means divorced or widowed) |
4 | education |
(categorical: 'basic.4y', 'basic.6y', 'basic.9y', 'high.school', 'illiterate', 'professional.course', 'university.degree', 'unknown') |
5 | default |
has credit in default? (categorical: 'no', 'yes', 'unknown') |
6 | housing |
has housing loan? (categorical: 'no', 'yes', 'unknown') |
7 | loan |
has personal loan? (categorical: 'no', 'yes', 'unknown') |
--- | --- | --- |
8 | contact |
contact communication type (categorical: 'cellular', 'telephone') |
9 | month |
last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') |
10 | day_of_week |
last contact day of the week (categorical: 'mon', 'tue', 'wed', 'thu', 'fri') |
11 | duration |
last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. |
--- | --- | --- |
12 | campaign |
number of contacts performed during this campaign and for this client (numeric, includes last contact) |
13 | pdays |
number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) |
14 | previous |
number of contacts performed before this campaign and for this client (numeric) |
15 | poutcome |
outcome of the previous marketing campaign (categorical: 'failure', 'nonexistent' , 'success') |
--- | --- | --- |
16 | emp.var.rate |
employment variation rate - quarterly indicator (numeric) |
17 | cons.price.idx |
consumer price index - monthly indicator (numeric) |
18 | cons.conf.idx |
consumer confidence index - monthly indicator (numeric) |
19 | euribor3m |
euribor 3 month rate - daily indicator (numeric) |
20 | nr.employed |
number of employees - quarterly indicator (numeric) |
--- | --- | --- |
21 | y |
has the client subscribed a term deposit? (binary: 'yes', 'no') [outcome of the marketing campaign] |
As shown below, there is no undefined value in any of the provided record lines, neither a strange set of values or outliers different from the expected ones for any of the attributes.
In [4]:
gl.canvas.set_target('ipynb')
bank_marketing.show()
It is also important to note that the original data set has many more prospects (36548) than existent customers (4640). However, may be a bad idea to make a stratified split over this data set since we will loose that way the time dimension of the problem. In order to better check if the time dimension is important for this problem and the record provided, we need to re-create the missing calendar dates and transform the original data set in a timeseries object.
In the few lines of code below:
date
) for the year, month and day of week of each provided record line, and produce the corresponding datetimes.
In [5]:
from helper_functions import *
def _month_to_number(x):
from dateutil import parser
return parser.parse(x).strftime('%m')
def _wkday_to_number(x):
from dateutil import parser
return parser.parse(x).strftime('%w')
def _str_to_datetime(x):
import datetime
import pytz
from dateutil import parser
return parser.parse(x).strftime('%Y-%m-%d')
def _unix_timestamp_to_datetime(x):
import time
import datetime
import pytz
from dateutil import parser
return parser.parse(x)
In [6]:
bank_marketing['y'] = bank_marketing['y'].apply(lambda x: 1 if x=='yes' else 0)
bank_marketing['month_nr'] = bank_marketing['month'].apply(_month_to_number)
bank_marketing['wkday_nr'] = bank_marketing['day_of_week'].apply(_wkday_to_number)
bank_marketing['year'] = add_running_year(bank_marketing['month_nr'], 2008)
bank_marketing['date'] = add_running_date(bank_marketing, 'year', 'month_nr', 'wkday_nr')
bank_marketing['date'] = bank_marketing.apply(lambda row: '-'.join(map(str,(row['year'], row['month_nr'], row['date']))))
bank_marketing['date'] = bank_marketing['date'].apply(_str_to_datetime)
bank_marketing['date'] = bank_marketing['date'].apply(_unix_timestamp_to_datetime)
In [7]:
bank_marketing
Out[7]:
In [9]:
bank_marketing = gl.TimeSeries(bank_marketing, index='date')
The provided data set, bank_marketing
, has 41188 record lines describing various customers and prospects attributes, as well as their response in the telemarketing campaign of interest. The percentage of unique calendar dates across this record is low, whereas much more people seems to response positevely as the time goes by. However, some months are missing from the data set and adding time dimension in this problem cannot help to provide better predictions.
In [10]:
print 'Number of record lines [bank_marketing]: %d' % len(bank_marketing)
print 'Unique calendar dates across data set [bank_marketing]: %d' % len(bank_marketing['date'].unique())
unique_dates_pct = (len(bank_marketing['date'].unique())*100/float(len(bank_marketing)))
print 'Percentage of unique calendar dates across data set [bank_marketing]: %.2f%%'% unique_dates_pct
In [11]:
bank_marketing.filter_by(2008,'year')['month_nr'].unique().sort()
Out[11]:
In [12]:
print 'Full Data Set [year: 2008]:'
print '------------------------------'
bank_marketing_2008 = bank_marketing.filter_by(2008,'year')
customers = len(bank_marketing_2008[bank_marketing_2008['y']==1])
prospects = len(bank_marketing_2008[bank_marketing_2008['y']==0])
print 'Number of examples in year segment [bank_marketing]: %d' % len(bank_marketing_2008)
print 'Number of existent customers: %d (%.2f%%)' % (customers, 100*customers/float(len(bank_marketing_2008)))
print 'Number of prospects: %d (%.2f%%)\n' % (prospects, 100*prospects/float(len(bank_marketing_2008)))
In [13]:
bank_marketing.filter_by(2009,'year')['month_nr'].unique().sort()
Out[13]:
In [14]:
print 'Full Data Set [year: 2009]:'
print '------------------------------'
bank_marketing_2009 = bank_marketing.filter_by(2009,'year')
customers = len(bank_marketing_2009[bank_marketing_2009['y']==1])
prospects = len(bank_marketing_2009[bank_marketing_2009['y']==0])
print 'Number of examples in year segment [bank_marketing]: %d' % len(bank_marketing_2009)
print 'Number of existent customers: %d (%.2f%%)' % (customers, 100*customers/float(len(bank_marketing_2009)))
print 'Number of prospects: %d (%.2f%%)\n' % (prospects, 100*prospects/float(len(bank_marketing_2009)))
In [15]:
bank_marketing.filter_by(2010,'year')['month_nr'].unique().sort()
Out[15]:
In [16]:
print 'Full Data Set [year: 2010]:'
print '------------------------------'
bank_marketing_2010 = bank_marketing.filter_by(2010,'year')
customers = len(bank_marketing_2010[bank_marketing_2010['y']==1])
prospects = len(bank_marketing_2010[bank_marketing_2010['y']==0])
print 'Number of examples in year segment [bank_marketing]: %d' % len(bank_marketing_2010)
print 'Number of existent customers: %d (%.2f%%)' % (customers, 100*customers/float(len(bank_marketing_2010)))
print 'Number of prospects: %d (%.2f%%)' % (prospects, 100*prospects/float(len(bank_marketing_2010)))
In order to evaluate our learning algorithms later, we need to make a train/test split of the bank_marketing
SFrame. However, due to the class imbalance which is observed in contacts' response (it has much more prospects than original customers), we better do so in a stratified way.
In [17]:
## remove the time dimension of the problem
## transform the Timeseries object in a Numpy array
bank_marketing = bank_marketing.to_sframe().remove_column('date')
features = bank_marketing.column_names()
bank_marketing_np = bank_marketing.to_numpy()
## provide the stratified train/test split
skf = StratifiedKFold(bank_marketing['y'], n_folds=2, shuffle=True, random_state=1)
for train_idx, test_idx in skf:
train, test = bank_marketing_np[train_idx], bank_marketing_np[test_idx]
train = pd.DataFrame(train, index=train_idx, columns=features)
train = gl.SFrame(train, format='dataframe')
test = pd.DataFrame(test, index=test_idx, columns=features)
test = gl.SFrame(test, format='dataframe')
## restore original dtypes
for attrib in features:
train[attrib] = train[attrib].astype(bank_marketing[attrib].dtype())
test[attrib] = test[attrib].astype(bank_marketing[attrib].dtype())
In [18]:
print 'Training Data Set:'
print '---------------------'
train_customers = len(train[train['y']==1])
train_prospects = len(train[train['y']==0])
print 'Number of examples in training set [train]: %d' % len(train)
print 'Number of existent customers: %d (%.2f%%)' % (train_customers, 100*train_customers/float(len(train)))
print 'Number of prospects: %d (%.2f%%)\n' % (train_prospects, 100*train_prospects/float(len(train)))
print 'Test Data Set:'
print '-----------------'
test_customers = len(test[test['y']==1])
test_prospects = len(test[test['y']==0])
print 'Number of examples in validation set [test]: %d' % len(test)
print 'Number of existent customers: %d (%.2f%%)' % (test_customers, 100*test_customers/float(len(test)))
print 'Number of prospects: %d (%.2f%%)' % (test_prospects, 100*test_prospects/float(len(test)))
Before we start, let's assume that each phone call to a contact costs 1 USD and that the customer lifetime value for a contact that purchases a term deposit is 100 USD. Then the ROI for calling all the customers in our training dataset is:
In [19]:
def calc_call_roi(contact_list, lead_score, pct_tocall):
#assumptions
cost_ofcall = 1.00
cust_ltv = 100.00 #customer lifetime value
num_calls = int(len(contact_list) * pct_tocall)
if 'lead_score' in contact_list.column_names():
contact_list.remove_column('lead_score')
contact_list = contact_list.add_column(lead_score, name='lead_score')
sorted_bymodel = contact_list.sort('lead_score', ascending=False)
call_list = sorted_bymodel[:num_calls]
num_subscriptions = len(call_list[call_list['y']==1])
roi = (num_subscriptions * cust_ltv - num_calls * cost_ofcall) / float(num_calls * cost_ofcall)
return roi
In [20]:
init_leadscores = gl.SArray([1 for _ in test])
init_roi = calc_call_roi(test, init_leadscores, 1)
print 'ROI for calling all contacts [test]: %.2f%%' % init_roi
Usually middle-aged, employed people who have good annual earnings are much better prospects to contact and keep them informed for a new product. Indeed, as shown below 11.27% of the contacts in the training set opened a long-term deposit account, whereas 43.06% of them were employed (but not students) and had age less than 38.
In [22]:
num_customers = float(len(train))
numY = gl.Sketch(train['y']).frequency_count(1)
print "%.2f%% of contacts in training set opened long-term deposit accounts." % (numY/num_customers * 100.0)
median_age = gl.Sketch(train['age']).quantile(0.5)
num_purchasing_emp_under_median_age = sum(train.apply(lambda x: 1 if x['age']<median_age
and ((x['job']!='unemployed') &
(x['job']!='student') &
(x['job']!='unknown'))
and x['y']==1 else 0))
probY_emp_under_median_age = (num_purchasing_emp_under_median_age / float(numY)) * 100.0
print "%.2f%% of the clients who opened long-term deposit accounts, were employed (but not students) and had age < %d (median)." % (probY_emp_under_median_age, median_age)
In [24]:
target_leadscore = test.apply(lambda x: 1 if x['age']<median_age
and ((x['job']!='unemployed') & (x['job']!='student') & (x['job']!='unknown'))
and x['y']==1 else 0)
age_targeting_roi = calc_call_roi(test, target_leadscore, 0.2)
print 'ROI for targeted calls [employed (not students) and age < %d (median)] to 20%% of contacts: %.2f%%' % (median_age, age_targeting_roi)
Result:
ROI for 20% of targeted contacts (employed and not students, had age < 38): 28.75%
Major improvement over the ROI achieved by calling EVERYONE in list
Calling everyone in list does not have greater ROI!
Dato's classifier toolkit can choose the most effective classifier model automatically.
In [23]:
## remove features that introduce noise in ML prediction
features = train.column_names()
features.remove('duration')
features.remove('y')
features.remove('month_nr')
features.remove('wkday_nr')
features.remove('year')
## GLC AutoML Classifier
toolkit_model = gl.classifier.create(train, features=features, target='y')
The toolkit automatically evaluates several types of algorithms, including: Boosted Trees, Random Forests, Decision Trees, Support Vector Machines, Logistic regression - with intelligent default paramters. Based on a validation set, it chooses the most accurate model which in our case is a Boosted Trees Classifier. We can then evaluate this model on the test
data set.
In [24]:
results = toolkit_model.evaluate(test)
print "accuracy: %.5f, precision: %.5f, recall: %.5f" % (results['accuracy'], results['precision'], results['recall'])
This initial model can be considered accurate given that it correctly predicts the purchasing decisions of ~90% of the contacts. However, the toolkit_model
leaves room for improvement. Specifically only ~66% of predicted sales actually convert to sales. Furthermore, only ~24% of actual sales were actually predicted by the model. In order to better understand the model we can review the importance of the input features.
In [25]:
toolkit_model.get_feature_importance()
Out[25]:
In [26]:
toolkit_leadscore = toolkit_model.predict(test,output_type='probability')
toolkit_roi = calc_call_roi(test, toolkit_leadscore, 0.2 )
print 'ROI for calling 20%% of highest predicted contacts: %.2f%%' % toolkit_roi
Result:
The scatter plots diagrams among the various int
and float
variables of the train
data set, and more specifically the four plots below, suggest to investigate if interactions between these features exist:
qfeatures0 = ['emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m']
In [26]:
import matplotlib.pyplot as plt
%matplotlib inline
qfeatures0 = ['emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m']
plt.figure(figsize=(10,10))
subplot_idx = 1
for attrib1 in qfeatures0:
for attrib2 in qfeatures0:
if(attrib2 != attrib1):
if subplot_idx < 5:
plt.subplot(2,2,subplot_idx)
plt.scatter(train[attrib1], train[attrib2])
plt.xlabel(attrib1)
plt.ylabel(attrib2)
plt.title('\'%s\' vs \'%s\'' % (attrib1, attrib2))
subplot_idx +=1
plt.show()
Next we add quadratic interactions between the four features below:
qfeatures0 = ['emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m']
In [27]:
## define a quadratic transformer object
quadratic_transformer = gl.feature_engineering.QuadraticFeatures(features=qfeatures0)
## fit the quadratic transformer object over the train set
quadratic = gl.feature_engineering.create(train, quadratic_transformer)
## transform the train data set
qtrain = quadratic.transform(train)
## remove the features that may worse our predictions
qfeatures = qtrain.column_names()
qfeatures.remove('duration')
qfeatures.remove('y')
qfeatures.remove('month_nr')
qfeatures.remove('wkday_nr')
qfeatures.remove('year')
In [28]:
qtrain.head(5)
Out[28]:
and re-train the GraphLab Create AutoML Classifier for this new data set qtrain.
In [38]:
new_toolkit_model = gl.classifier.create(qtrain, target='y', features=qfeatures)
Next, we evaluate the new AutoML Classifier, new_toolkit_model
, on the test
data set.
In [39]:
results = new_toolkit_model.evaluate(quadratic.transform(test))
print "accuracy: %.5f, precision: %.5f, recall: %.5f" % (results['accuracy'], results['precision'], results['recall'])
Note that this model is almost as accurate as the previous one, with similar precision (~66% of the predicted sales were actually converted to sales) and recall (~24% of actual sales were actually predicted by the model). However, to have a better feeling of the model just trained (newtoolkit_model
) and how this differs from the previous one (toolkit_model
), we can review the importance of the input features in these two cases.
In [40]:
print '\'newtoolkit_model\'\n[GLC AutoML Classifier wt quadratic interactions]:\n'
print new_toolkit_model.get_feature_importance()
In [41]:
print '\'toolkit_model\'\n[GLC AutoML Classifier wo quadratic interactions]:\n'
print toolkit_model.get_feature_importance()
By comparing these two models we note that:
The quadratic interactions:
emp.var.rate
∗ euribor3m
cons.conf.idx
∗ euribor3m
seems to be important if we want to build a more accurate model and they should not be neglected.
age
, euribor3m
and campaign
features are significant in both cases.pdays
), the number of contacts performed before this campaign and for this client (previous
), as well as the channel through which the contact has been made (contact: 'telephone'
) are important for both models.day_of_week
('mon'
) that the contact has been made seems to be important for both models.
In [42]:
## show ROI for experimentation model
newtoolkit_leadscore = new_toolkit_model.predict(quadratic.transform(test),output_type='probability')
newtoolkit_roi = calc_call_roi(quadratic.transform(test), newtoolkit_leadscore, 0.2)
print 'ROI for calling predicted contacts: %.2f%%' % newtoolkit_roi
Result:
age
grouping, hyperparameters fine-tuningAs it is obvious from the histogram below most of our contacts were between 30 to 36 years old, 4518 contacts were between 36 to 43, 2931 people between 43 to 50 and about 2500 people between 23 to 30 and 50 to 56 years old. The remaining contacts were either younger or older than these ages, but certainly not more than 1000 cases or so in a specific age group. Therefore, grouping the age
values of our contacts into a pre-defined number of bins may be beneficial for the learning algorithm of choice, and may improve the ROI of our telemarketing campaign even further.
In [35]:
qtrain['age'].show()
To group the age
values of our contacts we leverage the FeatureBinner
method of the feature_engineering
toolkit of GraphLab Create as shown below.
In [45]:
## define a binning transformer for the age attribute of contacts
age_binning_transformer = gl.feature_engineering.FeatureBinner(features='age', strategy='quantile', num_bins=12)
## fit the age binning transformer over the train set
age_binning = gl.feature_engineering.create(train, age_binning_transformer)
## transform the train data set
qtrain1 = age_binning.transform(qtrain)
## remove the features that may worse our predictions
qfeatures1 = qtrain1.column_names()
qfeatures1.remove('duration')
qfeatures1.remove('y')
qfeatures1.remove('month_nr')
qfeatures1.remove('wkday_nr')
qfeatures1.remove('year')
In [46]:
qtrain1['age'].show()
Lets now train a boosted trees classifier model using this enriched data set, qtrain1
. We have also tweak its parameters to achieve better predictive performance.
In [148]:
## We create a boosted trees classifier with the enriched dataset.
new_boostedtrees_model = gl.boosted_trees_classifier.create(qtrain1, target='y', features = qfeatures1,
max_iterations = 100,
max_depth=5,
step_size=0.1,
min_child_weight=0.06,
random_seed=1,
early_stopping_rounds=10)
Next we evaluate the new_boostedtrees_model
on the test data set.
In [159]:
results = new_boostedtrees_model.evaluate(age_binning.transform(quadratic.transform(test)))
print "accuracy: %.5f, precision: %.5f, recall: %.5f" % (results['accuracy'], results['precision'], results['recall'])
This new model (new_boostedtrees_model
) is almost as accurate as the previous one, has higher precision (~66% of the predicted sales were actually converted to sales) and similar recall (~23% of actual sales were actually predicted by the model). To have a better feeling of the model just trained (newtoolkit_model
) and how this differs from the previous one (new_toolkit_model
), we can review the importance of the input features in these two cases.
In [161]:
print '\'new_boostedtrees_model\'\n[GLC Boosted Trees Classifier wt quadratic interactions,\
age grouping & hyperparams tuned]:\n'
new_boostedtrees_model.get_feature_importance().print_rows(num_rows=20)
In [46]:
print '\'newtoolkit_model\'\n[GLC AutoML Classifier wt quadratic interactions]:\n'
print new_toolkit_model.get_feature_importance()
By comparing these two cases, we note that:
cons.price.idx
∗ euribor3m
cons.price.idx
∗ emp.var.rate
enters the new_boostedtrees_model
as significant attributes.euribor3m
and campaign
features are significant in both cases whereas age
is far less important in this new tweaked model.pdays
), the number of contacts performed before this campaign and for this client (previous
), the channel through which the contact has been made (contact: 'telephone'
), as well as the specific day_of_week
('mon'
) that this contact occured seems to be important for both models.new_boostedtrees_model
, attributes describing if the contact has a personal loan (loan
) or credit in default (default
), his/her educational (education
) and employment status (job
), as well as the outcome of the previous marketing campaign (poutcome
) gets greater importance than before.
In [176]:
## show ROI for experimentation model
test1 = age_binning.transform(quadratic.transform(test))
boostedtrees_leadscore = new_boostedtrees_model.predict(test1, output_type='probability')
boostedtrees_roi = calc_call_roi(test1, boostedtrees_leadscore, 0.2)
print 'ROI for calling predicted contacts: %.2f%%' % boostedtrees_roi
Conclusion:
We will choose the
new_boostedtrees_model
to lead score the contact list. By doing so, we can achieve a 35.13% ROI by only contacting 20% of the people in this list. Assuming we had the budget and time to contact everyone in list, we will have an ROI of only 10.27%, a fact that emphasize the importance of lead scoring as a method. Furthermore, the ROI achieved by our tunednew_boostedtrees_model
is significantly greater than the ROI returned by simply targeting employed, middle-aged people which was found to be 28.75%.
Assuming we have time and resources to call 20% of the lead scored contact list, these would be the first 30 people that we should call first:
In [183]:
pct_tocall = 0.2
boostedtrees_list = test1.sort('lead_score', ascending=False)
num_calls = int(len(boostedtrees_list)*pct_tocall)
print 'Assuming we have time and resources to call %d%% of the lead scored contact list, we\
need to make %d phone calls.\n' % (pct_tocall*100, num_calls)
print 'Lead Scored Contact List:'
boostedtrees_list['lead_score', 'age','campaign','euribor3m','job','loan', 'default', 'poutcome'].\
print_rows(num_rows=50, max_row_width=100)
Note, that quite differently from our informed decision of Part 1, this new tweaked new_boostedtrees_model
predicts high lead scores for students and retired people who have no a personal loan or credit in default. As a result, the model suggests to include them in our target contacts.
In [ ]: